It has been seen from our applications of cellcycleR to single cell level data that the method has a tendency to pick up those genes as influential ones which have very high expression patterns under some cell states. Then the ordering is carried out based on the expression levels of these genes and the method tries to arrange the cells in the order based on the peaks of their expressions at these cell states (clumps these cell states together). As a result of this, the method is unable to detect true sinusoidal patterns in the data.
Consider the following example
library(cellcycleR)
library(wavethresh)
## Loading required package: MASS
## WaveThresh: R wavelet software, release 4.6.6, installed
##
## Copyright Guy Nason and others 1993-2013
##
## Note: nlevels has been renamed to nlevelsWT
G <- 100;
num_cells <- 128;
amp_genes <- rep(1, G);
phi_genes <- rep(c(2,5), each=G/2);
sigma_genes <- rchisq(G, 1);
cell_times_sim <- seq(0,2*pi, length.out=num_cells);
cycle_data <- sim_sinusoidal_cycle(G, amp_genes, phi_genes, sigma_genes, cell_times_sim);
celltime_levels <- 128;
Notice from the simulation design so far that we have picked most fenes to have high noise variation compared to signal variation. This is because in real cell cycle data, we do see a lot of noise variation.
Now we assume the first \(10\) genes to be ribosomal genes. We pick 10 cell states (uniformly spaced) where these ribosomal genes have higher expression.
n_states <- 10;
n_ribo <- 10;
cell_states <- sapply(1:n_states, function(s) floor((s*length(cell_times_sim))/n_states));
for(m in 1:n_states){
for(n in 1:n_ribo){
cycle_data[cell_states[m],n] <- rnorm(1,8,1);
}
}
We now observe the graphs of a ribosomal gene and a non-ribosomal gene.
plot(cycle_data[,2], type="l")
plot(cycle_data[,40], type="l")
We now reorder the data.
sample_reorder <- sample(1:num_cells,num_cells, replace=FALSE);
cell_times_reorder <- cell_times_sim[sample_reorder];
cycle_data_reorder <- cycle_data[sample_reorder,];
system.time(out_sinusoidal <- sin_cell_ordering_class(cycle_data_reorder, celltime_levels = celltime_levels, num_iter=500))
## Final loglikelihood, iter 500 : -8487.51456199
## user system elapsed
## 317.627 153.287 168.335
library(plotrix)
## Warning: package 'plotrix' was built under R version 3.2.3
library(RColorBrewer)
radial.plot(lengths=1:length(out_sinusoidal$cell_times),radial.pos=out_sinusoidal$cell_times[order(cell_times_reorder)],
line.col=colorRampPalette(brewer.pal(9,"Blues"))(length(out_sinusoidal$cell_times)), lwd=2)
radial.plot(lengths=1:length(cell_times_reorder),radial.pos=sort(cell_times_reorder),
line.col=colorRampPalette(brewer.pal(9,"Blues"))(length(cell_times_reorder)), lwd=2)
The plots of estimated gene pattern and the true gene pattern.
plot(cycle_data_reorder[order(out_sinusoidal$cell_times),2], type="l")
plot(cycle_data[,2],type="l")
plot(cycle_data_reorder[order(out_sinusoidal$cell_times),30], type="l")
plot(cycle_data[,30],type="l")
plot(cycle_data_reorder[order(out_sinusoidal$cell_times),50], type="l")
plot(cycle_data[,50],type="l")
Consider the following example
library(cellcycleR)
library(wavethresh)
G <- 100;
num_cells <- 128;
amp_genes <- rep(1, G);
phi_genes <- rep(c(2,5), each=G/2);
sigma_genes <- rchisq(G, 3);
cell_times_sim <- seq(0,2*pi, length.out=num_cells);
cycle_data <- sim_sinusoidal_cycle(G, amp_genes, phi_genes, sigma_genes, cell_times_sim);
celltime_levels <- 128;
Notice from the simulation design so far that we have picked most fenes to have high noise variation compared to signal variation. This is because in real cell cycle data, we do see a lot of noise variation.
Now we assume the first \(10\) genes to be ribosomal genes. We pick 10 cell states (uniformly spaced) where these ribosomal genes have higher expression.
n_states <- 10;
n_ribo <- 10;
cell_states <- sapply(1:n_states, function(s) floor((s*length(cell_times_sim))/n_states));
for(m in 1:n_states){
for(n in 1:n_ribo){
cycle_data[cell_states[m],n] <- rnorm(1,8,1);
}
}
We now observe the graphs of a ribosomal gene and a non-ribosomal gene.
plot(cycle_data[,2], type="l")
plot(cycle_data[,40], type="l")
We now reorder the data.
sample_reorder <- sample(1:num_cells,num_cells, replace=FALSE);
cell_times_reorder <- cell_times_sim[sample_reorder];
cycle_data_reorder <- cycle_data[sample_reorder,];
system.time(out_sinusoidal <- sin_cell_ordering_class(cycle_data_reorder, celltime_levels = celltime_levels, num_iter=500))
## Final loglikelihood, iter 500 : -27495.4719848
## user system elapsed
## 326.457 170.612 179.124
library(plotrix)
library(RColorBrewer)
radial.plot(lengths=1:length(out_sinusoidal$cell_times),radial.pos=out_sinusoidal$cell_times[order(cell_times_reorder)],
line.col=colorRampPalette(brewer.pal(9,"Blues"))(length(out_sinusoidal$cell_times)), lwd=2)
radial.plot(lengths=1:length(cell_times_reorder),radial.pos=sort(cell_times_reorder),
line.col=colorRampPalette(brewer.pal(9,"Blues"))(length(cell_times_reorder)), lwd=2)
The plots of estimated gene pattern and the true gene pattern.
plot(cycle_data_reorder[order(out_sinusoidal$cell_times),2], type="l")
plot(cycle_data[,2],type="l")
plot(cycle_data_reorder[order(out_sinusoidal$cell_times),30], type="l")
plot(cycle_data[,30],type="l")
plot(cycle_data_reorder[order(out_sinusoidal$cell_times),50], type="l")
plot(cycle_data[,50],type="l")
It seems that cellcyleR is not doing a bad job at extracting the patterns in the data even when the ribosomal genes which result in high expression at certain specific cell states are present.